Skip to content

Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE#741

Merged
jotabulacios merged 5 commits into
perf/cpu-lde-rework_gpufrom
pr-715-fixes
Jun 29, 2026
Merged

Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE#741
jotabulacios merged 5 commits into
perf/cpu-lde-rework_gpufrom
pr-715-fixes

Conversation

@MauroToscano

@MauroToscano MauroToscano commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review fixes for #715 (row-major GPU LDE). Targets the PR branch so they fold into #715. Net cleanup, −~150 lines.

Changes

  • Revert unrelated FxHashMap op-dedup change (scope creep). The FxHasher/FxHashMap micro-optimization in types.rs + branch/dvrm/lt/mul.rs is unrelated to the LDE rework and was only applied to 4 of 6 dedup tables (eq.rs/bytewise.rs left on std HashMap). Reverted to std::collections::HashMap; it can land as its own focused PR.
  • Remove redundant gpu_and_cpu_proofs_both_verify test. The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle_root_parity / barycentric_cpu_gpu_parity tests. Its "force CPU" leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under #[cfg(test)], but from the prover integration-test crate stark compiles without cfg(test), so the OnceLock cached the first value and the second set_var was a no-op. Also simplified gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.
  • Fix stale docs / dead code. Moved keccak256_leaves_base_row_major out of keccak_merkle_level's doc block (it had split the "child pair → one parent" comment away from its kernel). Deleted columns_to_row_major in prover.rs — zero callers after the row-major path stopped materializing GPU-expanded columns.
  • Guard row-major keccak. Added debug_assert!(num_rows >= 2) to launch_keccak_base_row_major; the kernel computes __brevll(tid) >> (64 - log_num_rows), which is UB at num_rows == 1. Matches the existing guard in launch_keccak_base.
  • Consolidate the row-major LDE pipeline. Extracted coset_lde_row_major_inner, shared by the base and ext3 _keep entry points (they differed only by m vs m * 3 and the handle type), removing ~110 lines of drift-prone duplication.
  • Fix stale assertion in gpu_path_fires_end_to_end. It asserted gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts > 2 path since perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles) #700. Now asserts on the sum so either composition-LDE path satisfies it.

Verification

GPU-validated on RTX 5090 / CUDA 13.1:

  • make test-math-cuda78/78 passed, including merkle_root_parity (base + ext3) and barycentric_cpu_gpu_parity — confirms the coset_lde_row_major_inner consolidation is behavior-preserving.
  • make test-cuda-integrationgreen (gpu_path_fires_end_to_end: GPU dispatches fire, proof verifies).
  • make bench-prover-cudaprove(fib_iterative_1M) = 5.89s, GPU LDE path fired.

Host checks: cargo check + cargo clippy -p lambda-vm-prover clean; all edited Rust files rustfmt-clean; the keccak.cu change is a pure reorder (added lines == removed lines, no logic change).

The FxHasher/FxHashMap op-dedup micro-optimization is unrelated to the row-major GPU LDE rework and was only applied to 4 of 6 dedup tables. Revert the table maps to std HashMap and drop the hasher; it can land as its own focused PR.
The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle/barycentric parity tests. Its force-CPU leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under cfg(test), but from the prover integration crate stark compiles without cfg(test), so the OnceLock cached the first value. Simplify gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.
keccak.cu: move keccak256_leaves_base_row_major out of keccak_merkle_level's doc block so the child-pair->parent doc rejoins its kernel. prover.rs: delete columns_to_row_major, which has no callers after the row-major GPU path stopped materializing GPU-expanded columns.
Extract coset_lde_row_major_inner shared by the base and ext3 _keep entry points (they differed only by m vs m*3 and the handle type), removing ~110 lines of drift-prone duplication. Add debug_assert!(num_rows >= 2) to launch_keccak_base_row_major: the kernel shifts by (64 - log_num_rows), UB at num_rows==1, matching the guard in launch_keccak_base.
The assert checked gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts>2 path (gpu_parts_lde_calls) since #700. Assert on the sum so either composition-LDE path satisfies it. Validated on RTX 5090 / CUDA 13.1: make test-math-cuda 78/78, make test-cuda-integration green, proof verifies.
@jotabulacios jotabulacios merged commit 4bb4798 into perf/cpu-lde-rework_gpu Jun 29, 2026
11 checks passed
@jotabulacios jotabulacios deleted the pr-715-fixes branch June 29, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants